Project Description

Objective

Use INE mobility data to explore the value of CORINE land cover data when modeling human mobility in metropolitan areas.

Questions

Can random forest regression with land cover variables improve predictions compared to a simple gravity model?

Given the large number of land cover variables (52, when including both origin and destination data) and their uncertain relationship to human mobility, random forests may help incorporate this data into a model without making assumptions about interactions (i.e. between origin and destination variables) or transformations (i.e. log scale). Additionally, random forest models indicate the relative importance of each variable, which may also be interesting.

Which model provides the best predictions for new metropolitan areas?

Given that INE provides mobility data that covers nearly all residents of Spain, a model of mobility is more useful if it can provide accurate predictions for other areas. To explore this, I use mobility data for the 40 largest metropolitan areas in Spain and run each model (RF, Gravity) 40 times. Each time, one city serves as the test data while the other 39 cities make up the training set. I then compare the predictions to observed values using root mean square errors (RMSE) and the Common Part of Commuters (CPC) metric that is common in the literature.

When mapped, are there substantive difference between the model predictions?

I map flows for selected metropolitan areas, to visually examine the differences between the model predictions.

Data

INE provided data on flows between mobility areas for all Wednesdays and Sundays in 2021. Here, I limited the data to 13 October 2021, because this was the Wednesday in which Spain had the fewest daily Covid cases, as well as few restrictions. Hopefully mobility on this date best approximates “the new normal.” I include only flows within each metropolitan area, not flows between them.

Models

Gravity Model

Linear model of flows between mobility areas with the following independent variables: population of origin, population of destination, distance between destination and origin. All variables are on the log scale.

Random Forest (RF)

Random forest regression with 500 trees including the variables in the gravity model and the 52 land cover variables (total area of each land cover type). Modeled using the ranger package.

Results

Can random forest regression with land cover variables improve predictions compared to a simple gravity model?

rf <- ranger(dependent.variable.name = 'flujo', data = data, num.threads = 8)

gravity <- lm(log(flujo) ~ log(pob_destino) + log(pob_residencia) + log(dist), data = data)
data <- data %>% mutate(flujo_pred_rf = predictions(predict(rf, .)),
                        flujo_pred_grav = exp(predict(gravity, .)),
                        errors_rf = flujo-flujo_pred_rf,
                        errors_grav = flujo-flujo_pred_grav,
                        min_rf = ifelse(flujo<=flujo_pred_rf,flujo,flujo_pred_rf),
                        min_grav = ifelse(flujo<=flujo_pred_grav,flujo,flujo_pred_grav))

tibble(Model = c("RF","Gravity"), 
       RMSE = c(sqrt(mean(data$errors_rf^2)),
                sqrt(mean(data$errors_grav^2))), 
       CPC = c(sum(2*data$min_rf)/(sum(data$flujo)+sum(data$flujo_pred_rf)),
               sum(2*data$min_grav)/(sum(data$flujo)+sum(data$flujo_pred_grav))))
## # A tibble: 2 × 3
##   Model    RMSE   CPC
##   <chr>   <dbl> <dbl>
## 1 RF       44.0 0.893
## 2 Gravity 151.  0.581

When modeling the full 40-city dataset, I find that the random forest model is superior in both RMSE and CPC. For random forest regressions, we can see the degree to which each variable contributes to the model:

It appears that the origin variables ("_residencia“) are frequently more important that the destination ones (”_destino"). The following table summarizes the mean importance of the two types of variables:

## # A tibble: 2 × 2
##   `Land Cover Variables` `Mean Importance`
##   <chr>                              <dbl>
## 1 Destination                    14744895.
## 2 Origin                         11298130.

This indicates that human mobility in the 40 metropolitan areas is more dependent on “push” factors than “pull” ones.

Which model provides the best predictions for new metropolitan areas?

The following summarizes the 40 model runs in which each city, sequentially, serves as the test data while the other 39 serve as the training data.

## Cities with most accurate RF predictions:
## # A tibble: 6 × 7
##   city               rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
##   <chr>                <dbl>   <dbl> <chr>        <dbl>  <dbl> <chr>     
## 1 Alicante/Alacant      97.4   142.  RF           0.758  0.585 RF        
## 2 Lleida                74.2    99.0 RF           0.744  0.614 RF        
## 3 Girona                82.5   127.  RF           0.737  0.581 RF        
## 4 Gijón                 93.2   120.  RF           0.736  0.637 RF        
## 5 Castellón/Castelló   134.    204.  RF           0.735  0.471 RF        
## 6 León                 105.    140.  RF           0.731  0.563 RF
## Cities with least accurate RF predictions:
## # A tibble: 6 × 7
##   city                   rmse_rf rmse_lm better_rmse cpc_rf cpc_lm better_cpc
##   <chr>                    <dbl>   <dbl> <chr>        <dbl>  <dbl> <chr>     
## 1 Madrid                    148.    88.7 Gravity      0.511  0.649 Gravity   
## 2 Vitoria/Gasteiz           589.   747.  RF           0.534  0.237 RF        
## 3 Badajoz                   230.   273.  RF           0.595  0.354 RF        
## 4 Jaén                      194.   237.  RF           0.614  0.416 RF        
## 5 Donostia-San Sebastián    262.   305.  RF           0.615  0.448 RF        
## 6 Ourense                   205.   258.  RF           0.623  0.397 RF
## Plot of CPC for each test city:

For every city, save Madrid, the random forest model outperforms the gravity model.

When mapped, are there substantive difference between the model predictions?

The maps below show the log flows (in both directions combined) for the mobility areas in the selected metropolitan area. The left map shows the observed flows, the center map shows the gravity model’s predictions, and the right map shows the random forest model’s predictions. The color scale is the same for each.